Backtransform data before mapping statistics #4194

yutannihilation · 2020-09-06T13:40:20Z

The staged values should be bypass the transformation as probably it is already transformed

but, I was wrong. The transformation that the data needs to be bypassed is not the one after mapping the calculated variables, but the one before mapping. Otherwise, if we do another calculation over the calculated value, the result is still wrong even if it skip the transformation at the end of Layer$map_statistic().

Here's an example. When evaluating after_stat, the actual scaled value of x is 4, and the result of x / 2 will be 2, which will locate at 4 in sqrt scale. But, in actual, x / 2 is 8.

d <- data.frame(value = 16)

ggplot(d) +
  geom_point(aes(stage(value, after_stat = x / 2), 0))

The transformation we want to bypass is this one. But, the operations that come after this (i.e. training and mapping positions) depends on the scaled values, so we cannot skip this transformation.

ggplot2/R/plot-build.r

Lines 60 to 61 in 6b8dba0

    
           # Transform all scales 
        
           data <- lapply(data, scales_transform_df, scales = scales)

After some thinking, I come to a conclusion that the values need to be back-transformed before evaluating after_stat. There might be cleaner mechanism for this, but I believe this is a realistic solution at the moment...

library(ggplot2)

d <- data.frame(value = 16)

# Current behaviour

ggplot(d) +
  geom_point(aes(value / 2, 0), colour = "green", size = 10) +
  geom_point(aes(stage(value, after_stat = x / 2), 1), colour = "purple", size = 10) +
  scale_x_sqrt(limits = c(0, 16), breaks = c(2, 4, 8))

# This pull request

devtools::load_all("~/GitHub/ggplot2/")
#> ℹ Loading ggplot2

ggplot(d) +
  geom_point(aes(value / 2, 0), colour = "green", size = 10) +
  geom_point(aes(stage(value, after_stat = x / 2), 1), colour = "purple", size = 10) +
  scale_x_sqrt(limits = c(0, 16), breaks = c(2, 4, 8))

^{Created on 2022-05-15 by the reprex package (v2.0.1)}

clauswilke · 2020-09-06T17:24:53Z

Something makes me uncomfortable about this. At a minimum, I think this needs much more documentation to really explain the logic at each step.

thomasp85 · 2020-09-06T17:46:27Z

will this not potentially break every plot that has ever used stat() or ..aes..?

I must admit that I think it is better to document what has happened to the data up until the after_stat phase and simply let that be the end of it...

clauswilke · 2020-09-06T18:42:26Z

@thomasp85 Something like this PR may very well be needed. I've dealt with a similar problem in coords. However, I think we should do some more due diligence before merging.

It'd be a breaking change, but I think the cases where this would hit would be quite rare, since it requires a combination of using stat() and using a scale with transformation, and the latter is usually applied to positions and the former is usually not. I'm not sure I've ever made a plot that meets this criterion (except my reprex below).

I'm providing another reprex, using the current ggplot2 (without this patch), that shows that the current behavior is confusing and inconsistent with how everything else in ggplot2 works.

library(ggplot2)
library(ggridges)

df <- data.frame(x = rexp(100))

# works as expected
ggplot(df, aes(x, y = 0, fill = stat(x) < 1)) +
  geom_density_ridges_gradient()
#> Picking joint bandwidth of 0.287

# confusing result. why are almost all points green?
ggplot(df, aes(x, y = 0, fill = stat(x) < 1)) +
  geom_density_ridges_gradient() +
  scale_x_log10()
#> Picking joint bandwidth of 0.194

# we have to transform x according to the scale before things work,
# that seems strange from a user's perspective
ggplot(df, aes(x, y = 0, fill = stat(x) < 0)) +
  geom_density_ridges_gradient() +
  scale_x_log10()
#> Picking joint bandwidth of 0.194

^{Created on 2020-09-06 by the reprex package (v0.3.0)}

thomasp85 · 2020-09-06T18:52:16Z

I'm not convinced, but I'm also not 100% against. Scale transformations are applied as the very first thing in the data pipeline which is something there is value in teaching and understanding. It seems weird to me to cherry-pick parts of the data-pipeline and undue them at certain points...

Again, not saying this shouldn't be done, but I'd be weary of doing this since, as you note yourself, this is an extreme edge case and the change required is not nothing...

clauswilke · 2020-09-06T19:05:57Z

Another example. This one I find even more disconcerting, because it produces a plot that is objectively wrong.

library(ggplot2)
library(ggridges)

df <- data.frame(x = rexp(100))

ggplot(df, aes(x, y = 0, fill = stat(x))) +
  geom_density_ridges_gradient() +
  scale_x_log10() +
  scale_fill_viridis_c()
#> Picking joint bandwidth of 0.171

^{Created on 2020-09-06 by the reprex package (v0.3.0)}

thomasp85 · 2020-09-06T19:09:11Z

Well, it is only objectively wrong because we strip stat() from the guide title 😉

clauswilke · 2020-09-06T19:10:21Z

A possible compromise could be to leave after_stat() as is but add a new function that explicitly backtransforms. E.g. after_stat_bt(). Maybe somebody can come up with a better name.

thomasp85 · 2020-09-06T19:15:42Z

I'd much rather explore this (or make it an argument to after_stat()). In general I'd be very skeptic from a performance point of view to transform all aesthetics back and forth for no reason at all in 99.9% of all plot cases

yutannihilation · 2020-09-06T23:36:07Z

I too feel unhappy about this PR, and half of the purposes of this is to share the uncomfortableness with you so that you'll come up with some superseding solution, as you always did :)

I've dealt with a similar problem in coords.

Yeah, actually it made me think back-transformation is needed here as well.

In general I'd be very skeptic from a performance point of view to transform all aesthetics back and forth for no reason at all in 99.9% of all plot cases

Probably we can easily skip the back-transformation by checking if the trans is identity? Besides, the back-transformation only happens when there's some Stat (otherwise it returns early), so I don't think it makes up 99.9%.

Whether or not we end up adopting this, I think this is worth breaking change as the name after_stat() doesn't sound right, especially, now that we have after_scale(), which makes the users think after_stat() is before scale. And the actual breakage would be rare.

clauswilke · 2020-09-07T02:18:04Z

Whether or not we end up adopting this, I think this is worth breaking change as the name after_stat() doesn't sound right, especially, now that we have after_scale(), which makes the users think after_stat() is before scale. And the actual breakage would be rare.

I'm not suggesting changing these names again, but I did realize today that I've been always confused by after_scale() because it probably should be called after_mapping(). The data gets transformed (some may think of this as scaling the data), then the statistical transformations are applied, and finally the data gets mapped.

thomasp85 · 2020-09-07T05:51:53Z

Besides, the back-transformation only happens when there's some Stat (otherwise it returns early), so I don't think it makes up 99.9%

Sorry - I confused myself as to when this function would get called.

I'm not suggesting changing these names again, but I did realize today that I've been always confused by after_scale() because it probably should be called after_mapping(). The data gets transformed (some may think of this as scaling the data), then the statistical transformations are applied, and finally the data gets mapped.

I'm sure after_mapping() would be more confusing since we use the mapping term for a different purpose in the API (assigning data to aesthetics). I feel after_scale() is correct. This step happens after all scaling operations (transformation, censoring, mapping etc) has completed. I personally don't think these names in any way imply that all operations in the scale or the stat happens in one chunk, which is what I think all this confusing is from.

thomasp85 · 2020-09-07T05:54:24Z

@clauswilke what kind of related issues have you face in coords?

clauswilke · 2020-09-07T06:16:57Z

For geoms such geom_hline() or geom_vline(), I needed to backtransform the range to make these geoms work correctly:

ggplot2/R/geom-hline.r

Lines 46 to 48 in ac2b5a7

    
           GeomHline <- ggproto("GeomHline", Geom, 
        
             draw_panel = function(data, panel_params, coord) { 
        
               ranges <- coord$backtransform_range(panel_params)

Getting this all right was quite tricky, because in many places the code wasn't very clear about whether it needed a regular range or a backtransformed range. It took quite a while to disentangle all of this. I tried to document this here:

ggplot2/R/coord-.r

Lines 16 to 30 in ac2b5a7

    
           #'   - `backtransform_range(panel_params)`: Extracts the panel range provided 
        
           #'     in `panel_params` (created by `setup_panel_params()`, see below) and 
        
           #'     back-transforms to data coordinates. This back-transformation can be needed 
        
           #'     for coords such as `coord_trans()` where the range in the transformed 
        
           #'     coordinates differs from the range in the untransformed coordinates. Returns 
        
           #'     a list of two ranges, `x` and `y`, and these correspond to the variables 
        
           #'     mapped to the `x` and `y` aesthetics, even for coords such as `coord_flip()` 
        
           #'     where the `x` aesthetic is shown along the y direction and vice versa. 
        
           #'   - `range(panel_params)`: Extracts the panel range provided 
        
           #'     in `panel_params` (created by `setup_panel_params()`, see below) and 
        
           #'     returns it. Unlike `backtransform_range()`, this function does not perform 
        
           #'     any back-transformation and instead returns final transformed coordinates. Returns 
        
           #'     a list of two ranges, `x` and `y`, and these correspond to the variables 
        
           #'     mapped to the `x` and `y` aesthetics, even for coords such as `coord_flip()` 
        
           #'     where the `x` aesthetic is shown along the y direction and vice versa.

yutannihilation · 2020-09-07T15:04:55Z

I'm sure after_mapping() would be more confusing since we use the mapping term for a different purpose in the API (assigning data to aesthetics). I feel after_scale() is correct.

I agree with this part. While the word "scale" might have different meanings in different places in ggplot2, probably it still has cleaner meaning than "mapping," which is too general.

…cktransform-data

thomasp85 · 2022-05-11T11:22:14Z

Team - should we revive this for the upcoming release? I don't think my stance has changed all that much, but I'd be happy to consider an argument — I don't think a new function is the correct way because there would be no way to port this over to stage()

clauswilke · 2022-05-11T16:23:55Z

I just reread the entire thread and I'm not sure I have a useful opinion either way. Maybe it would be a good idea to bring in @teunbrand, since he filed the original issue that brought this up.

If we want to go forward with this, I think it needs an empirical approach. Implement a version of the suggested idea and see if it breaks things or has performance implications.

teunbrand · 2022-05-11T17:57:21Z

I do have a preference that the stage() family of functions would work as one would intuitively expect. I must admit I'm not 100% familiar with all design decisions involved with map_statistic, but I have an idea that I can work out in a PR if we're to discuss implementation details.

…transform-data

teunbrand · 2022-05-15T08:37:45Z

R/layer.r

@@ -299,6 +299,9 @@ Layer <- ggproto("Layer", NULL,
    # evaluation (since the evaluation symbols gets renamed)
    data <- rename_aes(data)

+    # data needs to be non-scaled
+    data_orig <- scales_backtransform_df(plot$scales, data)


I think you might be able to alleviate Thomas' concern that this is applied to 99% of plots, by executing this line after the early exit at the if (length(new) == 0) return(data) line.

Agreed, thanks!

teunbrand · 2022-05-15T09:09:13Z

I think we might even further reduce the impact of this PR on performance if we selectively backtransform the relevant columns, and leave other columns as-is.

# Suppose these are the aesthetics to evaluate
new <- aes(x = after_stat(y + 1), colour = 2, fill = Species)

# Extract all variables occuring in the aesthetics
vars <- unlist(lapply(new, all.vars), use.names = FALSE)

# Only backtransform variables that occur
data_orig <- scales_backtransform_df(scales, df[intersect(names(df), vars)])

Together with Hiroaki's suggestion earlier:

Probably we can easily skip the back-transformation by checking if the trans is identity?

I think the performance hit might be contained to a minimum.

yutannihilation · 2022-05-15T13:51:27Z

Thanks!

vars <- unlist(lapply(new, all.vars), use.names = FALSE)

I think this works in most cases, there's chance that some variable is referenced indirectly (e.g. get("x")). While it's probably not a good practice, I think we should not limit the variables for safety.

yutannihilation · 2022-05-15T14:52:57Z

If the concern is the performance, I think I addressed it. scales_transform_df() can be left as is, but I added the same workaround to skip non-transforming trans for consistency.

…transform-data

thomasp85 · 2022-05-18T09:29:03Z

@yutannihilation is this ready for review?

yutannihilation · 2022-05-18T13:05:30Z

Yes.

thomasp85

Can you add a news bullet indicating that this is a breaking change (though quite niche)

otherwise it's good to merge

yutannihilation · 2022-05-21T05:01:13Z

Thanks! Let me find some good explanation for NEWS.

…transform-data

yutannihilation · 2022-06-13T14:40:24Z

Added a test and a NEWS item.

…transform-data

Backtransform data before mapping statistics

e05cdca

Merge remote-tracking branch 'upstream/master' into fix/issue-4155-ba…

1ada56c

…cktransform-data

thomasp85 added this to the ggplot2 3.4.0 milestone Mar 25, 2021

yjunechoe mentioned this pull request Oct 17, 2021

ggtrace_aes wrapper yjunechoe/ggtrace#20

Closed

teunbrand mentioned this pull request May 11, 2022

No retransform in mapping statistics #4835

Closed

Merge remote-tracking branch 'upstream/main' into fix/issue-4155-back…

c6dfbd2

…transform-data

teunbrand reviewed May 15, 2022

View reviewed changes

Push scales_backtransform_df() to below

4d9da99

yutannihilation added 3 commits May 15, 2022 23:04

Exit early when the scale doesn't have any trans

26c7456

Skip transformation when no trans or the trans is identity

cc333d0

Remove mistakenly added browser()

656f334

Merge remote-tracking branch 'upstream/main' into fix/issue-4155-back…

4d7d269

…transform-data

thomasp85 approved these changes May 20, 2022

View reviewed changes

yutannihilation added 3 commits June 13, 2022 22:51

Merge remote-tracking branch 'upstream/main' into fix/issue-4155-back…

e2d2e4d

…transform-data

Add a test

d904bb3

Add a NEWS item

ee6b6dc

yutannihilation added 3 commits June 16, 2022 23:31

Merge remote-tracking branch 'upstream/main' into fix/issue-4155-back…

03b739c

…transform-data

Tweak

232e205

Try clearing cache

e450ddf

yutannihilation merged commit 09ef058 into tidyverse:main Jun 16, 2022

yutannihilation deleted the fix/issue-4155-backtransform-data branch June 16, 2022 15:38

teunbrand mentioned this pull request Dec 22, 2022

Bug in ggplot2 3.4.0 using scales::probability_trans #5112

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backtransform data before mapping statistics #4194

Backtransform data before mapping statistics #4194

yutannihilation commented Sep 6, 2020 •

edited

Loading

clauswilke commented Sep 6, 2020

thomasp85 commented Sep 6, 2020

clauswilke commented Sep 6, 2020 •

edited

Loading

thomasp85 commented Sep 6, 2020

clauswilke commented Sep 6, 2020

thomasp85 commented Sep 6, 2020

clauswilke commented Sep 6, 2020

thomasp85 commented Sep 6, 2020

yutannihilation commented Sep 6, 2020 •

edited

Loading

clauswilke commented Sep 7, 2020

thomasp85 commented Sep 7, 2020

thomasp85 commented Sep 7, 2020

clauswilke commented Sep 7, 2020

yutannihilation commented Sep 7, 2020

thomasp85 commented May 11, 2022

clauswilke commented May 11, 2022

teunbrand commented May 11, 2022

teunbrand May 15, 2022

yutannihilation May 15, 2022

teunbrand commented May 15, 2022 •

edited

Loading

yutannihilation commented May 15, 2022

yutannihilation commented May 15, 2022

thomasp85 commented May 18, 2022

yutannihilation commented May 18, 2022

thomasp85 left a comment

yutannihilation commented May 21, 2022

yutannihilation commented Jun 13, 2022

	# Transform all scales
	data <- lapply(data, scales_transform_df, scales = scales)

Backtransform data before mapping statistics #4194

Backtransform data before mapping statistics #4194

Conversation

yutannihilation commented Sep 6, 2020 • edited Loading

clauswilke commented Sep 6, 2020

thomasp85 commented Sep 6, 2020

clauswilke commented Sep 6, 2020 • edited Loading

thomasp85 commented Sep 6, 2020

clauswilke commented Sep 6, 2020

thomasp85 commented Sep 6, 2020

clauswilke commented Sep 6, 2020

thomasp85 commented Sep 6, 2020

yutannihilation commented Sep 6, 2020 • edited Loading

clauswilke commented Sep 7, 2020

thomasp85 commented Sep 7, 2020

thomasp85 commented Sep 7, 2020

clauswilke commented Sep 7, 2020

yutannihilation commented Sep 7, 2020

thomasp85 commented May 11, 2022

clauswilke commented May 11, 2022

teunbrand commented May 11, 2022

teunbrand May 15, 2022

Choose a reason for hiding this comment

yutannihilation May 15, 2022

Choose a reason for hiding this comment

teunbrand commented May 15, 2022 • edited Loading

yutannihilation commented May 15, 2022

yutannihilation commented May 15, 2022

thomasp85 commented May 18, 2022

yutannihilation commented May 18, 2022

thomasp85 left a comment

Choose a reason for hiding this comment

yutannihilation commented May 21, 2022

yutannihilation commented Jun 13, 2022

yutannihilation commented Sep 6, 2020 •

edited

Loading

clauswilke commented Sep 6, 2020 •

edited

Loading

yutannihilation commented Sep 6, 2020 •

edited

Loading

teunbrand commented May 15, 2022 •

edited

Loading